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3.6 Asymptotic normality of M-estimates. First, let's note some of the conditions 
under which nonlinear functions of sample averages are asymptotically normal. Let / be 
a function from an open interval containing a point \x into R. Suppose the derivative 
/'(//) exists and is not 0. Let X\,Xi, . . . , be i.i.d. variables with mean \i and variance 
< a 2 := cr 2 (Xi) < oo. Let S n := X x + ■ ■ ■ + X n , and !„ := S n /n. Then 
\X n — ji\ = O p (n~ 1 / 2 ) by Chebyshev's inequality, and 

f(X n ) = f(fl)+f'(fJL)(X n - fJL) + o(\X n -fJL\), SO 

n 1/2 (f(X n ) - /(//)) = f'^)((S n -n^/n^) + o p (l). 

Thus the distribution of the left side converges to N(0, f'{pt) 2 a 2 ) by the central limit 
theorem. This kind of reasoning is known as the "delta-method." To extend the method 
to vector- valued random variables, let / be a real- valued function on an open set U cM. k . 
Then / is said to be Frechet differentiable at a point t E U if there is a vector v := f'(t) e 
R k such that 

f(u) = f(t) + v(u-t) + o(\u-t\) 

as u — > t. For k = 1, this is equivalent to the usual derivative. For k > 1, the components 
of f'(t) will be the partial derivatives df(u)/diii\ u =t, forming the gradient of / at t. Each 
partial derivative is a directional derivative in the direction of a coordinate axis. Existence 
of the Frechet derivative means that not only these partial derivatives exist, but the graph 
of / has a tangent hyperplane at (i, f(t)) G IR fc+1 , see Problem 2. 

Using the central limit theorem in IR fc , the delta- method extends straightforwardly to 
]R fc -valued random variables having finite second moments. 

If / takes values in R m then the definition of Frechet derivative is formally the same, 
but with the vector v replaced by a linear transformation from M. k into M m , given by an 
m x k matrix. The Frechet differentiability of / is equivalent to that of each of its m 
component real-valued functions. 

Now, let's consider an exponential family of order m in a minimal representation 
(2.5.3) with densities e e ' T<yX ^~^ e \ defined for 9 £ G C IR m , the interior of the natural 
parameter space. Recall Theorem 3.3.17, on MLEs for exponential families, and its proof. 
Let T n := ^ X^ILi T(Xi). The maximum likelihood estimator of 6>, when it exists, is 
(Vj) _1 (T n ). Here Vj has a continuous derivative which is an m x m matrix whose entries 
are second partial derivatives of j. The matrix is nonsingular and so has a nonsingular 
inverse, which is the derivative of (Vj) _1 (e.g. Rudin, 1976, p. 223, (52)). Since T n 
is asymptotically normal by the multidimensional central limit theorem (RAP, Theorem 
9.5.6), it follows by the delta- method that the MLE is also. 

Next, asymptotic normality will be shown in quite general cases. Let G be an open 
subset of IR m , (X,A,P) a probability space, and ifr a function from G x X into IR m . Let 
X±,X2, ... be i.i.d. with values in X and distribution P. It will be shown under further 
assumptions that for a sequence T n = T n (Xi, . . . , X n ) of statistics with values in G, if 

(3.6.1) n-^^t^T^X,) ^0 in probability, 

then the distribution of T n will converge to some normal law. 
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Recall that the definition (3.5.1) of approximate M-estimators of tp type had a fur- 
ther factor of n -1 / 2 as compared to (3.6.1), although it also assumes a stronger form of 
convergence. So it seems a relatively mild condition to assume that the sequence {T n } is 
consistent, as is shown under different conditions in Sees. 3.3 and 3.5. Also, asymptotic 
normality with mean of n 1 / 2 (T n — 0q) implies consistency in probability. At any rate, it 
will be an assumption here: 

(AN-1) For some 9q, T n (Xi, . . . , X n ) — > 9 in probability as n — > oo. 

In the next assumption, that each ip(6, •) is measurable is the same as (B-l) in Sec. 

3.5: 

(AN-2) For each G 6, ip(6, •) is measurable, and ip(-, •) is separable. 

The next condition is similar to (B-3) in Sec. 3.5: 
(AN-3) A(0) := Eil)(e,x) is defined and finite for all 0, and for O in (AN-1), A(0 O ) = 0. 

The further condition in (B-3), that A(0) ^ for 6 ^ O , is not needed here because 
of (AN-1), and for 6 near O , (AN-4)(i) below. 

The norm | • | on IR m will be taken to be |0| := max(]0i|, . . . , |0 m |). Then if || • || is 
the usual Euclidean norm, we have |0| < \\9\\ < m l t 2 \9\ for all 6, so (like any two norms 
on M m ) the two norms are equivalent within constant multiples. 

For the next condition, given any 9 G O, since © is open, for 5 > small enough, 
\4> — 9\ < 5 implies e 0. For such a 5 > depending on 9, let 

u(9,x,8) := smp{\ip(<l>,x)-ip(e,x)\: \<j> - 9\ < 8}. 

(This is called the modulus of continuity of ip(-,x) at 9.) The next assumption is: 

(AN-4) For some numbers a > 0, b > and 7 > 0, 

(i) for \9 - O I < 7, we have G and |A(0)| > a\9 - O |; 

(ii) max(£w(0,a:,t),£[w(0,£,t) 2 ]) < bt for any t > such that |0 - O | < 7 - t. 

The last assumption is 

(AN-5) E\tfj(9 ,x) | 2 < 00. 

In practice, since 9q is unknown, we would need to have a function •) such that 
E\ip{9 , x)\ 2 < 00 for all 0. Since P is also unknown, this means either that we assume P 
belongs to a family of laws for which the given assumptions hold, or we choose a ip such 
that they hold for all P. For (AN-5) this would mean that ip(6, •) is a bounded function 
of x for each 0. 

On first reading, I advise skipping to Lemma 3.6.13 and its "heuristic proof," then to 
the main Theorem 3.6.15 on asymptotic normality and its short proof. 

The detailed and rigorous proof (by Huber) is as follows. For any r and let rj(9, x) := 
ip(9,x) - A(0) and 

Z n (r,9) := l^M^X,) -^X,)}]/^ + n\X(r)\). 

Most of the work in proving asymptotic normality will be in the following: 
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3.6.2 Lemma. If assumptions (AN-2), (AN-3) and (AN-4) hold for some 9 , then 

snp{Z n (r,9 ) : \t-6 \ < 7 } -> 

in probability as n — > oo. 

Proof. By an affine change of coordinates, take 6>o = and 7 = 1 (this may change the 
values of a, b). 

The cube |r| < 1 will be decomposed into smaller cubes on which Z n (r, 0) will be 
bounded separately. Recall that \x] is the smallest integer > x. Given e > 0, let 

(3.6.3) M := [max(2,(36)/(ae))l. 

Let q := 1/M. Let K be a, positive integer and consider the cubes centered at the 
origin, Ck := {9 : |0| < (1 — <z) fc }, for k = 0, 1, . . . , K. For k > 1, decompose 
the difference Cfc_i \ Cfc into smaller cubes, as shown in Fig. 3.6A, with edges of length 
£ := £k '■= (l — q) k ~ 1 q. Let d := i/2. Then the centers of the smaller cubes are points 
^ whose coordinates are odd multiples of d, with |£| = (1 — q i ) fc_1 (l — q/2). (A reason for 
using the norm | • | is to make the latter norms |£| all equal, as they would not be for other 
norms such as the Euclidean norm.) 

For each k > 1, Ck-i \ Ck is decomposed into less than 2m(2M) m_1 of the smaller 
cubes, so the grand total number iV of the smaller cubes satisfies < 2 m KmM rn ~ l . Let 
them be numbered Bi, . . . , B^. K will be chosen depending on n, specifically K = K(n) 
is the unique integer such that 

(3.6.4) (l~q) K < n~ 3 / 4 < (1 - q) K ~\ 
or equivalently 

K(n)-K(3/4)log(n)/|log(l- ? )| < K(n), 

which implies 

(3.6.5) A = 0(logn) as n — > 00. 
Now, 

Pr{sup{Z n (r,0) : r G C } > 2e} < 

(3.6.6) 

Pr(sup{Z n (r, 0) : r G C K } > 2e) + £f =1 Pr(sup{Z n (r, 0) : r G B 3 } > 2e). 

It will be shown that the right side of (3.6.6) goes to as n — > 00, which will prove the 
Lemma. 

Consider any of the cubes Bj with center £ and edges of length £ = 2d = (1 — q) h ~ 1 q, 
for some = 1, . . . ,-fT. For any r G -Bj, we have by (AN-4) |A(r)| > a\r\ > a(l — q) k , 
and so 

(3.6.7) |A(r)-A(OI < Eu{^x,d) < bd < b(l-q) k q. 

Next, 

Z n (r,0) < Z n (r,0 + |Er=i[^^)-^(0,^)]/(^ 1/2 + ^|A(r)|), so 
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Figure 3.6A: M = 2 
Fig. 3.6 A. Decomposition of cubes: proof of Lemma 3.6.2 



(3.6.8) sup{Z n (r,0) : r G B 3 } < U n + V n 
where 

Un := YJl=iH^X h d) + Eu{^x,d)]/{na{l-q) k ), 
V n := T,7=i[v^X i )-r,(0,X i )]/( na (l-q) k ). 

Then 

Pr(U n >e) = Pr^tM^^d) - Eu(^x,d)] > ena(l - q) k - 2nEu(£, x, d)}. 
By (3.6.7) and (3.6.3), 

ea(l - q) k - 2Eu(£,x, d) > ea(l - q) k - 2bq{l - q) k > bq(l-q) k , 
which by (AN-4)(ii) and Chebyshev's inequality implies 

(3.6.9) Pr(U n >e) < l/[bqn(l - q) k }. 
For V n , 

E[( V (^-)- V (0,-)) 2 ] = var[r/(£, •) - r/(0, •)] = vai#(£, •) - V(0, •)] 

< i?K0,-,|^|) 2 ] < b{l-q) k -\ 
Then by Chebyshev's inequality again, since (3.6.3) implies b/(a 2 e 2 ) < l/{9bq 2 ), we have 

(3.6.10) Pr(V n >e) < l/[9nbq 2 (l - q) k+1 ]. 
From (3.6.4), (3.6.8), (3.6.9) and (3.6.10) we get 

(3.6.11) Pr(sup{Z n (r,0) : rEB j }>2e) < Jn' 1 / 4 
where J := [bq(l - q)]' 1 + [9bq 2 (l - q) 2 ]' 1 . Thus by (3.6.5), asn^oo 

Pr(sup{Z n (r,0) : r G C \ C K } > 2e) < NJn' 1 / 4 = Oin' 1 / 4 log n). 

Also, 

sup{Z n (r,0): reC K } < n~ 1 ' 2 £? =1 [u(0, X,, 5) + £u(0, x, 5)} 
for 5 := (l-q) K < n" 3 / 4 by (3.6.4). So, 

Pr(sup{Z n (r,0) : r G C K } > 2e) < 
Py{^ =1 [u(0,X z ,5) -Eu(0,x,5)} > 2n 1 / 2 e - 2nEu(0,x,5)} . 
Since by (AN-4)(ii) Eu(0,x,6) < b5 < 6n" 3 / 4 , there is an no such that 

2n l / 2 e - 2nEu(0, x, 5) > n x l 2 e for n > n . 
Then by Chebyshev's inequality, since var(tt(0, x, 8)) < E[u(0 : x, S) 2 ] < b5 < bn~ 3 / 4 , 

(3.6.12) Pr(sup{Z n (r,0) : r G C } > 2e) < 0(n~ 3 / 4 ) + Oin' 1 ! 4 log n), 
which gives Lemma 3.6.2. □ 



The next step toward asymptotic normality is: 

3.6.13 Lemma. Assume that (AN-1) through (AN-5) hold and T n are estimators sat- 
isfying (3.6.1). For the rigorous proof, instead of (AN-1), it suffices to assume that 
Pr{|T n - O | < 7} -> 1 as n -> 00. Then 

n 1 ' 2 (A(T n ) + I Y?i=x Wo,Xi)) - in probability. 

Notes. As n — > 00, the second term being multiplied by n 1 / 2 converges to A(0o), which is 
by (AN-3). The second term is O p (n~ 1 / 2 ) but not o p (n~ 1 / 2 ) in general, by the central 
limit theorem. Thus the Lemma implies that A(T n ) is of the same order. See Problem 1. 

(AN-1) states that T n — > 9q in probability, and so Pr{|T n — 9q\ < 7} — > 1. The proof 
will show that from (3.6.1), T n — > 0o at an O p (l/y/n) rate. 

Heuristic proof of Lemma 3.6.13. Here Lemma 3.6.2 won't be applied. Assumptions 
(AN-4) about E[u(9, x, t) 2 } for 9 = 9 and t = 7 and (AN-5) imply that for all 9 in a 
neighborhood £/" of 6>o, namely \9 — 9o\ < 7, we have E(\ip(9, -)| 2 ) < 00. So, we can apply 
the multidimensional central limit theorem (RAP, Theorem 9.5.6) to get that for 9 E U, 
the distribution of 

1 n 

converges as n — > 00 to a normal distribution iV(0, C) where the covariance matrix C = 
C 6 ' 00 depends on 9 and 9 . (Recall that A(0 O ) = 0.) As 9 -> O , £(1^(0, •) "V^o, -)| 2 ) -»■ 
by (AN-4)(ii), and so Var(^(0, •) - Vj(0o, •)) -> for j = 1, . . . ,m. It follows that the 
matrix C d > ° -> as -> O . 

Now, substituting 9 = T n and recalling that as n — > 00, T n — > O in probabil- 
ity by (AN-1), and rT x l 2 Y£=\ ^( T n' X i) ^ in probability, we are left with D„ := 
n~ x l 2 YTi=\ ("^ (@o, Xi) — A(T n )) having asymptotic distribution N(0, C 7 ™' 610 ), but since 
T n ^0o, -Dn - ► in probability. With a minus sign, this gives the conclusion of Lemma 
3.6.13. 

Rigorous proof of Lemma 3.6.13. As in the proof of Lemma 3.6.2, assume that O = 
and 7 = 1. Note that 

ZtiMTmXi) = Er=i[^(^ 5 ^)-^(0,^)-A(T n )] + Er=i[^(0,^)+A(T n )]. 

So, with probability converging to 1, 

\Etiim^)+MT n )]\/(ni/ 2 + n\\(T n )\) < 

sup{Z n (r,0): |r|<l} + n' 1 ^ £^ =1 V>(T n , Xt)\. 

The terms on the right approach in probability as n — > 00 by Lemma 3.6.2 and (3.6.1), 
so the left side does also. 

Given < e < 1, let V 2 := 2E(\if>(0, x)\ 2 )/e, which is finite by (AN-5), and V > 0. 
Then by Chebyshev's inequality, there is an no such that for n > no, the inequalities 

(3.6.14) and 
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\EtiimX t )+X(T n )]\ < e(n^ + n\X(T n )\) 

each hold except with probability < e/2, so both are true on an event with probability at 
least 1 — s. The latter inequality implies 

nV2|A(T n )|(l-e) < s + n~W\ £^ =1 ^(0,^)1, 

so when (3.6.14) holds, n 1 / 2 |A(T n )| < (V + e)/(l —e). So for n > no, with probability at 
least 1 — e, we have 

i=i \ 6 / 6 

Letting e { 0, V = 0(£ -1 / 2 ), = O^ 1 / 2 ), and the lemma follows. □ 

Since A(-) takes the open set C W 71 into M m , its Frechet derivative at 6>o, if it exists, 
is a linear transformation A from IR m into itself, given by an m x m matrix. Note that 
if A exists and is non-singular, then (AN-4)(i) follows. Let B' denote the transpose of a 
matrix B. 

3.6.15 Theorem. Assume (AN-1) through (AN-5), that T n satisfy (3.6.1), and that A has 
a non-singular Frechet derivative A at 9q. Then n 1 / 2 (T n — 9 ) is asymptotically normal 
with mean and covariance matrix A~ 1 C(A~ 1 )' , where C is the covariance matrix of 
ip(9 ,x). 

Proof. Since A(0 O ) = 0, we have |A(0) - A(9 - 9 )\ = o(\9 - 9 \) as \9 - 9 \ ^ 0. 

By Lemma 3.6.13, the central limit theorem (RAP, Sec. 9.5), and Lemma 11.9.4 of 
RAP, as n — > oo, n x / 2 A(T n ) has distribution converging to N(0, C) (a minus sign doesn't 
change this distribution). Thus \A(T n - 9 )\ + o(\T n - 9 \) = Op(n _1 / 2 ). Since A is non- 
singular, o(\T n -9 \) = o(\A(T n -9 )\). By (AN-1), T n -> 9 in probability. It follows then 
that \A(T n - 9 )\ = Opin- 1 / 2 ), and \T n - 9 \ = O^n" 1 / 2 ), so o{\T n - 9 \) = o^n" 1 ^), 
and A{^/n{T n — 6>o)) — > N(0,C) in distribution, again using (RAP, Lemma 11.9.4). Thus 
by the continuous mapping theorem (RAP, Theorem 9.3.7) applied to A -1 , 

v^(T n -#o) - Nfrtyo^A- 1 )- 1 = N(0,A~ 1 C(A~ 1 )') 

(RAP, Proposition 9.5.12). □ 



PROBLEMS 

1. Let ip(6,x) := x — 9. Suppose the observations Xi, A 2 , . . . are i.i.d. for a probability 
measure P such that / xdP = 9 - Suppose that in (3.6.1) the estimators T n are chosen 
so that the given expression is rather than just converging to 0. Then evaluate T n 
and A(-) explicitly. Show that the expression which converges to in probability in the 
conclusion of Lemma 3.6.13 is actually identically in this case. 

2. Let (r, 9) be the usual polar coordinates on IR 2 . Let f(x,y) := sin(26>) for r > and 
/(0, 0) := 0. Show that the partial derivatives df/dx and df/dy exist at all points 
and are both at (0, 0), but / is not Frechet differentiable or even continuous at (0, 0). 
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3. Let f(x) := x + x 2 sin(l/;r 2 ) for x 7^ and f(0) := 0. Show that / is differentiable 
at all x and f'(0) = 1, but /' is unbounded in every neighborhood of 0, so / is not C . 
Also, / is not one-to-one in any neighborhood of 0. 

NOTES 

This section is based on Sec. 4 of Huber (1967). If the function A(-) is C 1 , and 
has a non-singular derivative at 9q, then a C 1 local inverse function A -1 exists from a 
neighborhood of to a neighborhood of 9q, by the inverse function theorem, e.g. Rudin 
(1976, p. 223, (52)), or Hoffman (1975, Sec. 8.5, Theorem 7), and the delta-method applies 
to the inverse function. It turns out to be unnecessary to assume existence of the Frechet 
derivative of A except at 0q i n Theorem 3.6.15 but, since 9q is unknown, in practice we would 
need the differentiability everywhere and in the cases actually encountered in statistics, 
it seems that A(-) will then be C 1 . Non-C 1 functions having derivatives everywhere exist 
even in one dimension (Problem 3) but seem to be rather pathological. 
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